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Abstract 

Solomonoff completed the Bayesian framework by providing a rigorous, 
unique, formal, and universal choice for the model class and the prior. We dis- 
cuss in breadth how and in which sense universal (non-i.i.d.) sequence predic- 
tion solves various (philosophical) problems of traditional Bayesian sequence 
prediction. We show that Solomonoff 's model possesses many desirable prop- 
erties: Fast convergence and strong bounds, and in contrast to most classical 
continuous prior densities has no zero p(oste)rior problem, i.e. can confirm uni- 
versal hypotheses, is reparametrization and regrouping invariant, and avoids 
the old-evidence and updating problem. It even performs well (actually bet- 
ter) in non-computable environments. 
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1 Introduction 



Examples and goal. Given the weather in the past, what is the probability 
of rain tomorrow? What is the correct answer in an IQ test asking to continue 
the sequence 1,4,9,16,? Given historic stock-charts, can one predict the quotes of 
tomorrow? Assuming the sun rose 5000 years every day, how likely is doomsday (that 
the sun does not rise) tomorrow? These are instances of the important problem 
of inductive inference or time-series forecasting or sequence prediction. Finding 
prediction rules for every particular (new) problem is possible but cumbersome and 
prone to disagreement or contradiction. What we are interested in is a formal general 
theory for prediction. 

Bayesian sequence prediction. The Bayesian framework is the most consistent 
and successful framework developed thus far |Ear93j . A Bayesian considers a set 
of environments=hypotheses=models Ai which includes the true data generating 
probability distribution /i. From one's prior belief w u in environment vEM. and the 
observed data sequence x = Xi...x n , Bayes' rule yields one's posterior confidence in v. 
In a predictive setting, one directly determines the predictive probability of the next 
symbol x n+ i without the intermediate step of identifying a (true or good or causal 
or useful) model. Note that classification and regression can be regarded as special 
sequence prediction problems, where the sequence xiyi...x n y n x n+ i of (x,y)-pairs is 
given and the class label or function value y n +i shall be predicted. 

Universal sequence prediction. The Bayesian framework leaves open how to 
choose the model class M. and prior w v . General guidelines are that M. should be 
small but large enough to contain the true environment fi, and w u should reflect one's 
prior (subjective) belief in v or should be non-informative or neutral or objective if no 
prior knowledge is available. But these are informal and ambiguous considerations 
outside the formal Bayesian framework. Solomonoff's [Sol64 rigorous, essentially 
unique, formal, and universal solution to this problem is to consider a single large 
universal class M.y suitable for all induction problems. The corresponding universal 
prior is biased towards simple environments in such a way that it dominates— 
superior to all other priors. This leads to an a priori probability M(x) which is 
equivalent to the probability that a universal Turing machine with random input 
tape outputs x. 

History and motivation. Many interesting, important, and deep results have been 
proven for Solomonoff's universal distribution M [ZL701 ISoT78l ILV971 lHut04] . The 
motivation and goal of this paper is to provide a broad discussion of how and in which 
sense universal sequence prediction solves all kinds of (philosophical) problems of 
Bayesian sequence prediction, and to present some recent results. Many arguments 
and ideas could be further developed. I hope that the exposition stimulates such a 
future, more detailed, investigation. 

Contents. In Section |21 we review the excellent predictive performance of Bayesian 
sequence prediction for generic (non-i.i.d.) countable and continuous model classes. 
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Section|3]critically reviews the classical principles (indifference, symmetry, minimax) 
for obtaining objective priors, introduces the universal prior inspired by Occam's ra- 
zor and quantified in terms of Kolmogorov complexity. In Section 0] (for i.i.d. M) 
and Section (for universal A4u) we show various desirable properties of the uni- 
versal prior and class (non-zero p(oste)rior, confirmation of universal hypotheses, 
reparametrization and regrouping invariance, no old-evidence and updating prob- 
lem) in contrast to (most) classical continuous prior densities. Finally, we show that 
the universal mixture performs better than classical continuous mixtures, even in 
uncomputable environments. Section |H1 contains critique and summary. 



2 Bayesian Sequence Prediction 

Notation. We use letters t,nE IN for natural numbers, and denote the cardinality 
of a set S by j^S or |«S|. We write X* for the set of finite strings over some alphabet 
X, and X°° for the set of infinite sequences. For a string iG X* of length £{x) = n 
we write Xxx 2 .-.x n with x t G X, and further abbreviate x tm :— x t x t +x--- x n-i x n and 
x <n :=xx...x n -i. We assume that sequence u) = ui :00 E X°° is sampled from the "true" 
probability measure p, i.e. p(xx :n ) : = P[cj 1:n = xi :n | / u] is the /^-probability that uj starts 
with x\- n . We denote expectations w.r.t. p by E. In particular for a function / : 
X n —*1R, we have E[/] =E[/(a>i :jl )] = J2 Xl . ^{ x i-n)f{^i:n}- If A* * s unknown but known 
to belong to a countable class of environments=models=measures M. = {z/!,z/ 2 ,...}, 
and {H v :v EAi} forms a mutually exclusive and complete class of hypotheses, and 
w u :=P[H u ] is our prior belief in H u , then £(x 1:n ) := P[a> 1:fl = x 1:n ] = J2ueM P i u ^n = 
Xi :n \H u ]P[H u ] must be our (prior) belief in X\. n , and w v (xx :n ) :=P[H v \uJx:n — x i:n\ — 
p[^i-n-xv. n \H v ]P[H v } ^ e p OS ^ er j or b e ii e f j n v by Bayes' rule. For a sequence ai,a 2 ,... 

of random variables, ^^ 1 E[a^] <c< oo implies a* — >0 with /i-probability 1 (w.p.l). 
Convergence is rapid in the sense that the probability that af exceeds e > at more 
than ^ times t is bounded by 5. We sometimes loosely call this the number of 
errors. 

Sequence prediction. Given a sequence xxX2---Xt-x, we want to predict its likely 
continuation x t . We assume that the strings which have to be continued are drawn 
from a "true" probability distribution fi. The maximal prior information a prediction 
algorithm can possess is the exact knowledge of //, but often the true distribution is 
unknown. Instead, prediction is based on a guess p of \i. While we require /z to be a 
measure, we allow p to be a semimeasure |LV97t IHutCMj : 1 Formally, p: X* — > [0,1] is 
a semimeasure if p(x)>J2 aeX p(xa)Vx<EX*, and a (probability) measure if equality 
holds and p(e) = l, where e is the empty string. p(x) denotes the p-probability that 
a sequence starts with string x. Further, p(a\x) : = p(xa) / p(x) is the "posterior" or 
"predictive" p-probability that the next symbol is aEX, given sequence xEX* . 

Bayes mixture. We may know or assume that p belongs to some countable class 
M. := {i/i,i/ 2 ,...} 3 p of semimeasures. Then we can use the weighted average on M. 

headers unfamiliar or uneasy with semimeasures can without loss ignore this technicality. 
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(Bayes-mixture, data evidence, marginal) 

£(x) := ) j w u -u(x), y]w v <l, w v >0. (1) 

for prediction. The most important property of semimeasure £ is its dominance 

£(ar) > w v v(x) VxandWG-M, in particular £(se) > w^{x) (2) 
which is a strong form of absolute continuity. 

Convergence for deterministic environments. In the predictive setting we are 
not interested in identifying the true environment, but to predict the next symbol 
well. Let us consider deterministic fi first. An environment is called deterministic if 
= 1 Vn for some sequence a, and /x = elsewhere (off-sequence). In this case 
we identify fi with a and the following holds: 

|l-£(«t|a<t)| < Inw" 1 and £(a tm \a t ) -> 1 for n > t —> oo (3) 

where w a >0 is the weight of a=fj,Ejii. This shows that £(<y t \a <t ) rapidly converges 
to 1 and hence also £(a t \a <t )— >0 for d t ^a t , and that £ is also a good multi-step 
lookahead predictor. Proof: £(ai :n ) — > c> 0, since £(cni :n ) is monotone decreasing 
in n and £(a;i :n ) > w li fjL(ai :n ) —w^ > 0. Hence £(ai :n )/£(o!i : t) — >c/c— 1 for any limit 
sequence £,n— >oo. The bound follows from ^™ =1 1— £(^t|^<t) < — E™=il n £( :r tl :r <t) = 
-ln£(xi :n ) and £(ai :ri ) >u>„. 

Convergence in probabilistic environments. In the general probabilistic case 
we want to know how close £ t :=£( • t|w<t) € iR'*' is to the true probability fj, t := 
/i( • t |ct?<t). One can show that 

E" =1 Eh] < D„(/,|iO:=E[ln|g^}] < ln<, (4) 

where s t = s t (/A t ,£ t ) can be the squared Euclidian or Hellinger or absolute or KL 
distance between fx t and £ t , or the squared Bayes-regret |Hut04j . The first inequality 
actually holds for any two (semi)measures, and the last inequality follows from (J2J). 
These bounds (with n = oo) imply 

^(x t \u <t ) — fi(x t \uj <t ) — > for any x t rapid w.p.l for t — > oo. 

One can also show multi-step lookahead convergence ^(x t . nt \u <t ) — fi(x t:TH \uj <t ) — >(), 
(even for unbounded horizon 1 < n t — 1+ 1 —>■ oo) which is interesting for delayed 
sequence prediction and in reactive environments |Hut04j . 

Continuous environmental classes. The bounds above remain approximately 
valid for most parametric model classes. Let M. := {i>g : 9 G O C M d } be a family of 
probability distributions parameterized by a d- dimensional continuous parameter 9, 
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and fi = vg EM. the true generating distribution. For a continuous weight density 2 
w(9) >0 the sums ((TJ) are naturally replaced by integrals: 

£(ar) := / ' w(8)-v e (x) d9, fw(9)d9 = 1 (5) 

The most important property of £ was the dominance (J2J) achieved by dropping the 
sum over The analogous construction here is to restrict the integral over 9 to 
a small vicinity of 9 . Since a continuous parameter can typically be estimated to 
accuracy ocn _1//2 after n observations, the largest volume in which vq as a function 
of 9 is approximately flat is oc (n~ 1 ^ 2 ) d , hence £(xi :n ) > n~ d / 2 w(9 )fi(xi :n ). Under 
some weak regularity conditions one can prove |CB90t IHut04j 

D n (ji\\£) := Elnfg^l < ln^o)- 1 + g In £ + \ lndet J n (0 o ) + o(l) (6) 

where u>(#o) is the weight density © of /i in £, and o(l) tends to zero for n^oo, 
and the average Fisher information matrix j n (9) — — E[VeV^ln^(a;i; n )] measures 
the local smoothness of u$ and is bounded for many reasonable classes, including all 
stationary (k th -order) finite-state Markov processes. We see that in the continuous 
case, D n is no longer bounded by a constant, but grows very slowly (logarithmically) 
with n, which still implies that e-deviations are exponentially seldom. Hence, © 
allows to bound (J1J even in case of continuous M.. 

3 How to Choose the Prior 

Classical principles. The probability axioms (implying Bayes' rule) allow to com- 
pute posteriors and predictive distributions from prior ones, but are mute about 
how to choose the prior. Much has been written on the choice of non-informative- 
=neutral=objective priors (see |KW96j for a survey and references; in Section El 
we briefly discuss how to incorporate subjective prior knowledge). For finite A4, 
Laplace's symmetry or indifference argument which sets w v = W G M. is a rea- 
sonable principle. The analogue uniform density w(9) = [Vol(G)] -1 for a compact 
measurable parameter space is less convincing, since w becomes non-uniform 
under different parametrization (e.g. 9^9':— y9). Jeffreys' solution is to find a 
symmetry group of the problem (like permutations for finite M. or translations for 
= M) and require the prior to be invariant under group transformations. Another 
solution is the minimax approach by Bernardo |CB90j which minimizes (the quite 
tight) bound (jUJ) for the worst fiEAi. Choice w(9) oc a/ Viet j n {6) equalizes and hence 
minimizes (|BJ) . Problems are that there may be no obvious symmetry, the resulting 
prior can be improper, depend on which parameters are treated as nuisance param- 
eters, on the model class, and on n. Other principles are maximum entropy and 
conjugate priors. The principles above, although not unproblematic, can provide 
good objective priors in many cases of small discrete or compact spaces, but we will 

2 w() will always denote densities, and Wq probabilities. 
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meet some more problems later. For "large" model classes we are interested in, i.e. 
countably infinite, non-compact, or non-parametric spaces, the principles typically 
do not apply or break down. 

Occam's razor et al. Machine learning, the computer science branch of statis- 
tics, often deals with very large model classes. Naturally, machine learning has 
(re) discovered and exploited quite different principles for choosing priors, appropri- 
ate for this situation. The overarching principles put together by Solomonoff Sol64j 
are: Occam's razor (choose the simplest model consistent with the data), Epicurus' 
principle of multiple explanations (keep all explanations consistent with the data), 
(Universal) Turing machines (to compute, quantify and assign codes to all quanti- 
ties of interest), and Kolmogorov complexity (to define what simplicity /complexity 
means) . 

We will first "derive" the so called universal prior, and subsequently justify it by 
presenting various welcome theoretical properties and by examples. The idea is that 
a priori, i.e. before seeing the data, all models are "consistent," so a-priori Epicurus 
would regard all models (in M.) possible, i.e. choose w u >0 WG.M. In order to also 
do (some) justice to Occam's razor we should prefer simple hypothesis, i.e. assign 
high prior/low prior w u to simple/complex hypotheses H u . Before we can define this 
prior, we need to quantify the notion of complexity. 

Notation. A function / : S — > iRU{±oo} is said to be lower semi-computable (or 
enumerable) if the set {(x,y) : y < f(x),x ES,y <E<Q} is recursively enumerable. / 
is upper semi- computable (or co-enumerable) if — / is enumerable. / is computable 
(or recursive) if / and — / are enumerable. The set of (co)enumerable functions 
is recursively enumerable. We write 0(1) for a constant of reasonable size: 100 
is reasonable, maybe even 2 30 , but 2 500 is not. We write f(x)<g(x) for f{x) < 
g(x)+0(l) and f(x) <g(x) for f(x) <2°W -g(x). Corresponding equalities hold if 
the inequalities hold in both directions. 3 We say that a property A(n) e {true, false} 
holds for most n, if #{t<n:A(t)}/n^—>l. 

Kolmogorov complexity. We can now quantify the complexity of a string. Intu- 
itively, a string is simple if it can be described in a few words, like "the string of one 
million ones" , and is complex if there is no such short description, like for a random 
object whose shortest description is specifying it bit by bit. We are interested in 
effective descriptions, and hence restrict decoders to be Turing machines (TMs). 
Let us choose some universal (so-called prefix) Turing machine U with binary in- 
put=program tape, Afary output tape, and bidirectional work tape. We can then 
define the prefix Kolmogorov complexity [LV97J of string x as the length I of the 
shortest binary program p for which U outputs x: 

K(x) := min{£(p) : U(p) = x}. 

p 

For non-string objects o (like numbers and functions) we define K(o) := K({o)), 
where (o)eX* is some standard code for o. In particular, if (fi)^ is an enumeration 
of all (co)enumerable functions, we define K(f i ) = K(i). 

3 We will ignore this additive/multiplicative fudge in our discussion till Sectional 
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An important property of K is that it is nearly independent of the choice of U. 
More precisely, if we switch from one universal TM to another, K(x) changes at 
most by an additive constant independent of x. For reasonable universal TMs, the 
compiler constant is of reasonable size 0(1). A defining property of K : X* — > IV 
is that it additively dominates all co-enumerable functions / : X* — > IN that satisfy 
Kraft's inequality ^2 X 2 ~f( x > < 1, i.e. K(x) <f(x) for K(f)=0(l). The universal 
TM provides a shorter prefix code than any other effective prefix code. K shares 
many properties with Shannon's entropy (information measure) S, but K is superior 
to 5* in many respects. To be brief, K is an excellent universal complexity measure, 
suitable for quantifying Occam's razor. We need the following properties of K: 

a) K is not computable, but only upper semi-computable, 

b) the upper bound K{n) < log 2 n+21og 2 logn, (7) 

c) Kraft's inequality Z~^ X 2~ K ^ < 1, which implies 2 _i ^ ri * ) < - for most n, 

d) information non-increase K(f(x)) < K(x) + K(f) for recursive f:X* — >X*, 

e) K{x)t -log 2 P(x)+K(P) if P: X* ->[0,1] is enumerable and J2 x P(x) < 1, 
/) J2 X :f( x )= y 2 ~ K{x) ~ 2 ~ K(y) if / is recursive and K{f)=0{l). 

Proofs of (a) — (e) can be found in |LV97j , and the (easy) proof of (/) in the extended 
version of this paper. 

The universal prior. We can now quantify a prior biased towards simple models. 
First, we quantify the complexity of an environment v or hypothesis H v by its 
Kolmogorov complexity K[y). The universal prior should be a decreasing function 
in the model's complexity, and of course sum to (less than) one. Since K satisfies 
Kraft's inequality (J?h), this suggests the following choice: 

w u = w u v := 2-KM (8) 

For this choice, the bound (jU) on D n reads 

Et=i E[s t ] < Ax> < K(fi)\n2 (9) 

i.e. the number of times, £ deviates from /i by more than e > is bounded by 
0(K([i)), i.e. is proportional to the complexity of the environment. Could other 
choices for w u lead to better bounds? The answer is essentially no |Hut04j : Consider 
any other reasonable prior w' u , where reasonable means (lower semi) computable 
with a program of size 0(1). Then, MDL bound (J7^) with PQ^-Wq and x^> (fx) 
shows K(fi) < —\og 2 w' lJl + K{w'^) ) hence lnu/^ 1 > K(fi)\n2 leads (within an additive 
constant) to a weaker bound. A counting argument also shows that 0(K(fi)) errors 
for most /i are unavoidable. So this choice of prior leads to very good prediction. 

Even for continuous classes Ai, we can assign a (proper) universal prior (not 
density) Wq = 2~ K ^ > for computable 9, and for uncomputable ones. This 
effectively reduces M. to a discrete class {vq^M. :Wq >0} which is typically dense 
in M.. We will see that this prior has many advantages over the classical prior 
densities. 
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4 Independent Identically Distributed Data 



Laplace's rule for Bernoulli sequences. Let x = xiX2--.x n G X n = {0,1}" be 
generated by a biased coin with head=l probability 9 G [0,1], i.e. the likelihood 
of x under hypothesis H e is u$(x) = P[x\H e ] = 9 ni (l — 9) n °, where n\ — x% + ...+ 

X n = Tl — ^o* 

Bayes assumed a uniform prior density w(9) = 1. The evidence is 
£(x) = f Q vg(x)w(9) d9 = ^+1)1 and the posterior probability weight density u>(0|:r) = 
vg(x)w(9)/£(x) = ^j^9 ni (l — 9) n ° of after seeing x is strongly peaked around the 
frequency estimate 9 — — for large n. Laplace asked for the predictive probability 
£(l|x) of observing x n+ \ — l after having seen x=x\...x n , which is ^(M x ) = = ■ 
(Laplace believed that the sun had risen for 5 000 years = 1826 213 days since 
creation, so he concluded that the probability of doom, i.e. that the sun won't rise 
tomorrow is 182 fi 215 •) This looks like a reasonable estimate, since it is close to the 
relative frequency, asymptotically consistent, symmetric, even defined for n = 0, and 
not overconfident (never assigns probability 1). 

The problem of zero prior. But also Laplace's rule is not without problems. 
The appropriateness of the uniform prior has been questioned in Section |3] and will 
be detailed below. Here we discuss a version of the zero prior problem. If the 
prior is zero, then the posterior is necessarily also zero. The above example seems 
unproblematic, since the prior and posterior densities w{9) and w(9\x) are non-zero. 
Nevertheless it is problematic e.g. in the context of scientific confirmation theory 
|Ear93| . 

Consider the hypothesis H that all balls in some urn, or all ravens, are black 
(=1). A natural model is to assume that balls/ravens are drawn randomly from an 
infinite population with fraction 9 of black balls/ravens and to assume a uniform 
prior over 9, i.e. just the Bayes-Laplace model. Now we draw n objects and observe 
that they are all black. 

We may formalize H as the hypothesis H':={9 = 1}. Although the posterior 
probability of the relaxed hypothesis H £ := {9 > 1-e}, P[# e |l n ] = J*_w(9\r)d9 = 
fi_ E (n+l)9 n d9 = l-(l-e) n+1 tends to 1 for n^oo for every fixed e>0, P[H'\l n } = 
P[Ho\l n ] remains identically zero, i.e. no amount of evidence can confirm H'. The 
reason is simply that zero prior P [//"'] = implies zero posterior. 

Note that H' refers to the unobservable quantity 9 and only demands blackness 
with probability 1. So maybe a better formalization of H is purely in terms of obser- 
vational quantities: H" ':= {^i :00 = l°°} . Since £(l n ) = -p;, the predictive probability 
of observing k further black objects is £(l fc |l n ) = — n Xk+i • While for fixed k 

this tends to 1, P[H"\l n ] = lim fc ^ 00 ^(l fc |l n ) = Vn, as for H' . 

One may speculate that the crux is the infinite population. But for a finite 
population of size N and sampling with (similarly without) repetition, P[if"|l n ] = 
£^iv-n|]^ _ n±±_ j g c i ose £ ne only if a large fraction of objects has been observed. 
This contradicts scientific practice: Although only a tiny fraction of all existing 
ravens have been observed, we regard this as sufficient evidence for believing strongly 
in H. 
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There are two solutions of this problem: We may abandon strict /logical/all- 
quantified/universal hypotheses altogether in favor of soft hypotheses like H £ . Al- 
though not unreasonable, this approach is unattractive for several reasons. The other 
solution is to assign a non-zero prior to 9 = 1. Consider, for instance, the improper 
density w{0) = \ [1+5(1-9)], where 5 is the Dirac-delta (Jf(9)S(9-a) d9 = f(a)), or 
equivalently P[9 > a] = 1 - \a. We get £(x 1:n ) = \ + 5 0no ], where 6 {j = { J } 
is Kronecker's 5. In particular £(l n ) = |Sf is much larger than for uniform prior. 
Since £(l*|l«) = g*±f .gl, we get P[^ ff |l»]=Um fc ^ oo e(l*|l n ) = ^-^l, i.e. H" gets 
strongly confirmed by observing a reasonable number of black objects. This correct 
asymptotics also follows from the general result Confirmation of H" is also 
reflected in the fact that ^(0|1") = jz^p, tends much faster to zero than for uniform 
prior, i.e. the confidence that the next object is black is much higher. The power 
actually depends on the shape of w{6) around 9 = 1. Similarly H' gets confirmed: 
P[H'\r}=fi 1 (l n )P{9 = l]/£(r) = ^-^l. On the other hand, if a single (or more) 
are observed (no>0), then the predictive distribution £('\x) and posterior w(9\x) 
are the same as for uniform prior. 

The findings above remain qualitatively valid for i.i.d. processes over finite non- 
binary alphabet \X\ >2 and for non-uniform prior. 

Surely to get a generally working setup, we should also assign a non-zero prior 
to 9 = and to all other "special" 9, like | and ~, which may naturally appear 
in a hypothesis, like "is the coin or die fair". The natural continuation of this 
thought is to assign non-zero prior to all computable 9. This is another motivation 
for the universal prior Wg = 2^ K ^ © constructed in Section El It is difficult but 
not impossible to operate with such a prior [PH04j. One may want to mix the 
discrete prior with a continuous (e.g. uniform) prior density, so that the set of 
non-computable 9 keeps a non-zero density. Although possible, we will see that this 
is actually not necessary. 

Reparametrization invariance. Naively, the uniform prior is justified by the in- 
difference principle, but as discussed in Section |3J uniformity is not reparametriza- 
tion invariant. For instance if in our Bernoulli example we introduce a new 
parametrization 9' = y/d, then the ^'-density w'{9') = 2\/9w(9) is no longer uniform 
if w(9) = l is uniform. 

More generally, assume we have some principle which leads to some prior w(9). 
Now we apply the principle to a different parametrization 9' E & and get prior w'(9'). 
Assume that 9 and 9' are related via bijection 9 = f(9'). Another way to get a ^'-prior 
is to transform the #-prior w(9)^w(9'). The reparametrization invariance principle 
(RIP) states that w' should be equal to w. 

For discrete O, simply Wqi =Wftgn, and a uniform prior remains uniform (w' ,= 
wgi =we= |^j) in any parametrization, i.e. the indifference principle satisfies RIP in 
finite model classes. 

In case of densities, we have w(9') = w(f(9')) d ^, ^ , and the indifference principle 
violates RIP for non-linear transformations /. But Jeffrey's and Bernardo's principle 
satisfy RIP. For instance, in the Bernoulli case we have ] n {9) = |+ j^q, hence w(9) = 
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l ~[e{i-e)]-v* and w\e') = \[f{e'){i-fm)\- 1,2dJ S 1 =^o')- 

Does the universal prior = 2~ K ^ satisfy RIP? If we apply the "univer- 
sality principle" to a #'-parametrization, we get w' e V = 2- K< - 9 '\ On the other 
hand, wg simply transforms to Wg, = w^,^ = 2^ K ^ { - e '^ [wg is a discrete (non- 
density) prior, which is non-zero on a discrete subset of M). For computable / 
we have K(f(9'))^K(9') + K(f) by flZtf), and similarly K{f-\0)) t K{9) + K(f) 
if / is invertible. Hence for simple bijections / i.e. for K(f) = 0(1), we have 
K(f(9')) = K(9'), which implies w' e V = Wg 7 ,, i.e. the universal prior satisfies RIP w.r.t. 
simple transformations / (within a multiplicative constant). 

Regrouping invariance. There are important transformations / which are not 
bijections, which we consider in the following. A simple non-bijection is 9 = f(8') = 
9' 2 if we consider 9' G [—1,1]. More interesting is the following example: Assume 
we had decided not to record blackness versus non-blackness of objects, but their 
"color". For simplicity of exposition assume we record only whether an object is 
black or white or colored, i.e. X' = {B,W,C}. In analogy to the binary case we 
use the indifference principle to assign a uniform prior on 0' G ©' := A 3 , where 
A d := {0' G [0,l] d : Yf i=1 9[ = 1}, and u e >(x' 1:n ) = W^- AU inferences regarding 
blackness (predictive and posterior) are identical to the binomial model vg(xi :n ) = 
9 ni (l-9) no with x' t = B ~» x t = l and x' t = WorC ~> x t = and 9 = f(8') = 9' B and 
w{9) = J A w'(0')5(9' B —9)d0'. Unfortunately, for uniform prior w'(9')(xl, w(9)ocl—9 
is not uniform, i.e. the indifference principle is not invariant under splitting/grouping, 
or general regrouping. Regrouping invariance is regarded as a very important and 
desirable property [Wal96j. 

We now consider general i.i.d. processes vq(x) 

= lltA- Dirichlet priors 
w(9) oc riiLi^f 4-1 form a natural conjugate class (w(9\x) oc nf=i^ +ai_1 ) an d are 
the default priors for multinomial (i.i.d.) processes over finite alphabet X of size 
d. Note that £(a\x) = n+ 2^ aa +ad generalizes Laplace's rule and coincides with Car- 
nap's [Ear93j confirmation function. Symmetry demands a± — ... — aa] for instance 
a = l for uniform and a = | for Bernard- Jeffrey's prior. Grouping two "colors" i 
and j results in a Dirichlet prior with a^j = ai + aj for the group. The only way 
to respect symmetry under all possible groupings is to set a = 0. This is Haldane's 
improper prior, which results in unacceptably overconfident predictions £(l|l n ) = l. 
Walley [Wal96j solves the problem that there is no single acceptable prior density 
by considering sets of priors. 

We now show that the universal prior w)j = 2~ K ^ is invariant under regrouping, 
and more generally under all simple (computable with complexity 0(1)) even non- 
bijective transformations. Consider prior w' e ,. If 9 = f(9') then w' g , transforms to 
u!g = J2g'-f(e')=e w 'e' ( n °te that for non-bijections there is more than one w' e , consistent 
with wg). In #'-parametrization, the universal prior reads w' g V = 2~ K ^ . Using 

df) with x = (9') and y=(9) we get lij^Ej':/^ 2 "^- 2 "^^^ i- e - the 
universal prior is general transformation and hence regrouping invariant (within a 
multiplicative constant) w.r.t. simple computable transformations /. 

Note that reparametrization and regrouping invariance hold for arbitrary classes 



10 



M. and are not limited to the i.i.d. case. 



5 Universal Sequence Prediction 

Universal choice of jVl. The bounds of Section |21 apply if M. contains the true 
environment /i. The larger M. the less restrictive is this assumption. The class of all 
computable distributions, although only countable, is pretty large from a practical 
point of view, since it includes for instance all of today's valid physics theories. It 
is the largest class, relevant from a computational point of view. Solomonoff Sol64, 
Eq.(13)] defined and studied the mixture over this class. 

One problem is that this class is not enumerable, since the class of computable 
functions /: X*— >M is not enumerable (halting problem), nor is it decidable whether 
a function is a measure. Hence £ is completely incomputable. Levin jZLTOIj had the 
idea to "slightly" extend the class and include also lower semi-computable semimea- 
sures. One can show that this class Mu = {vi,V2,---} is enumerable, hence 



is itself lower semi-computable, i.e. £[/ G M.jj, which is a convenient property in 
itself. Note that since wlo g2 ra < w^ n < - for most n by (J3>) and (jjfc) , most v n have 
prior approximately reciprocal to their index n. 

In some sense A4jj is the largest class of environments for which £ is in some 
sense computable jHut04| . but see |Sch02j for even larger classes. 

The problem of old evidence. An important problem in Bayesian inference 
in general and (Bayesian) confirmation theory |Ear93j in particular is how to deal 
with 'old evidence' or equivalently with 'new theories'. How shall a Bayesian treat 
the case when some evidence E=x (e.g. Mercury's perihelion advance) is known 
well-before the correct hypothesis/theory/model H=/i (Einstein's general relativity 
theory) is found? How shall H be added to the Bayesian machinery a posteriori? 
What is the prior of HI Should it be the belief in if in a hypothetical counterfactual 
world in which E is not known? Can old evidence E confirm HI After all, H could 
simply be constructed/biased/fitted towards "explaining" E. 

The universal class Aijj and universal prior formally solve this problem: 
The universal prior of H is 2~ K( - H \ This is independent of M. and of whether 
E is known or not. If we use E to construct H or fit H to explain E, this will 
lead to a theory which is more complex (K(H) > K(E)) than a theory from scratch 
(K(H) = 0(1)), so cheats are automatically penalized. There is no problem of adding 
hypotheses to M. a posteriori. Priors of old hypotheses are not affected. Finally, 
M.U includes all hypothesis (including yet unknown or unnamed ones) a priori. So 
at least theoretically, updating M. is unnecessary. 

Other representations of There is a much more elegant representation of 
£[/: Solomonoff |Sol64l Eq.(7)] defined the universal prior M(x) as the probability 




(10) 



v&Mu 
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that the output of a universal Turing machine U starts with x when provided with 
fair coin flips on the input tape. Note that a uniform distribution is also used in the 
so-called No-Free-Lunch theorems to prove the impossibility of universal learners, 
but in our case the uniform distribution is piped through a universal Turing machine 
which defeats these negative implications. Formally, M can be defined as 

M(x) := 2 ^ (P) " ( U ) 

p : U (p)=x* 

where the sum is over all (so-called minimal) programs p for which U outputs a 
string starting with x. M may be regarded as a 2~^ p )-weighted mixture over all 
computable deterministic environments v p [y p [x) = \ if U(p)=x* and else). Now, 
as a positive surprise, M{x) coincides with £,u( x ) within an irrelevant multiplicative 
constant. So it is actually sufficient to consider the class of deterministic semimea- 
sures. The reason is that the probabilistic semimeasures are in the convex hull of 
the deterministic ones, and so need not be taken extra into account in the mixture. 

Bounds for computable environments. The bound surely is applicable for 
£ = and now holds for any computable measure p. Within an additive constant 
the bound is also valid for M = £. That is, £u one? M are excellent predictors 
with the only condition that the sequence is drawn from any computable probability 
distribution. Bound © shows that the total number of prediction errors is small. 
Similarly to © one can show that Ylt=i\^~ M(x t \x <t )\ < Km(xi :n )ha2, where the 
monotone complexity Km(x) := mm{£(p) : U(p) = x*} is defined as the length of 
the shortest (nonhalting) program computing a string starting with x ZL70, LV97, 

IHutoIj . 

If Xi :00 is a computable sequence, then Km(xi :OQ ) is finite, which implies 
M(a?t|:r<t) — > 1 on every computable sequence. This means that if the environment 
is a computable sequence (whichsoever, e.g. 1°° or the digits of ir or e), after having 
seen the first few digits, M correctly predicts the next digit with high probability, 
i.e. it recognizes the structure of the sequence. In particular, observing an increas- 
ing number of black balls or black ravens or sunrises, M(l|l n ) — > 1 (Km(l°°) =0(1)) 
becomes rapidly confident that future balls and ravens are black and that the sun 
will rise tomorrow. 

Universal is better than continuous jVl. Although we argued that incom- 
putable environments p can safely be ignored, one may be nevertheless uneasy using 
Solomonoff 's M = £u (fTT|) if outperformed by a continuous mixture £ (0) on such 
p G A4\A4u, for instance if M would fail to predict a Bernoulli^) sequence for 
incomputable 6. Luckily this is not the case: Although u$Q and we can be incom- 
putable, the studied classes M. themselves, i.e. the two-argument function i/q(), and 
the weight function wn, and hence £(), are typically computable (the integral can 
be approximated to arbitrary precision). Hence M(x) —£,u( x ) >2~ K ^^(x) by (JIUj) 
and K(£) is often quite small. This implies for all p 

D n (p\\M) = E[lni^>] = E[lng^]+E[ln|^] £ D n (p\\0 + m)^ 
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So any bound © for D„(//||£) is directly valid also for D n (fi\\M), save an additive 
constant. That is, M is superior (or equal) to all computable mixture predictors 
£ based on any (continuous or discrete) model class M. and weight w(6), even if 
environment /i is not computable. Furthermore, while for essentially all parametric 
classes, -D n (/i||£)~|lnn grows logarithmically in n for all (incl. computable) /1G.M, 
D n (fi\\M) <K(fi)ln2 is finite for computable /i. Bernardo's prior even implies a 
bound for M that is uniform (minimax) in 9 G 0. Many other priors based on 
reasonable principles (see Section El and KW96J) and many other computable prob- 
abilistic predictors p are argued for. The above actually shows that M is superior 
to all of them. 

6 Discussion 

Critique and problems. In practice we often have extra information about the 
problem at hand, which could and should be used to guide the forecasting. One 
way is to explicate all our prior knowledge y and place it on an extra input tape of 
our universal Turing machine U, which leads to the conditional complexity K(-\y). 
We now assign "subjective" prior w^ y = 2~ K( - U ^ to environment u, which is large 
for those v that are simple |have short description) relative to our background 
knowledge y. Since K (p\y) < K (p) , extra knowledge never misguides (see Q). 
Alternatively we could prefix our observation sequence x by y and use M{yx) for 
prediction |Hut04j . 

Another critique concerns the dependence of K and M on U. Predictions for 
short sequences x (shorter than typical compiler lengths) can be arbitrary. But 
taking into account our (whole) scientific prior knowledge y, and predicting the now 
long string yx leads to good (less sensitive to "reasonable" U) predictions [Hut04 . 

Finally, K and M can serve as "gold standards" which practitioners should aim 
at, but since they are only semi- computable, they have to be (crudely) approxi- 
mated in practice. Levin complexity |LV97j . Schmidhuber's speed prior, the mini- 
mal message and description length principles Wal05 , and off-the-shelf compressors 
like Lempel-Ziv are such approximations, which have been successfully applied to a 
plethora of problems |CV05t |Sch04j . 

Summary. We compared traditional Bayesian sequence prediction based on con- 
tinuous classes and prior densities to Solomonoff's universal predictor M, prior , 
and class M.\j. We discussed: Convergence for generic class and prior, the relative 
entropy bound for continuous classes, indifference/symmetry principles, the problem 
of zero p(oste)rior and confirmation of universal hypotheses, reparametrization and 
regrouping invariance, the problem of old evidence and updating, that M works even 
in non-computable environments, how to incorporate prior knowledge, the predic- 
tion of short sequences, the constant fudges in all results and the ^-dependence, M's 
incomputability and crude but practical approximations. In short, universal predic- 
tion solves or avoids or meliorates many foundational and philosophical problems, 
but has to be compromised in practice. 
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